Goto

Collaborating Authors

 box annotation





792dd774336314c3c27a04bb260cf2cf-Supplemental.pdf

Neural Information Processing Systems

Finally,we train our model for 8hours on asingle V100GPU. We provide an illustration of our weakly supervised phrase grounding model in Figure 4b (this supplemental). Specifically,we create context-preserving negativecaptions for an image by substituting anoun in its original caption with negativenouns, that are sampled from apretrained BERT [17] model. Forexample,inthecase where only one cross-attention layer is used, adding the sentence-level contrastive loss leads to a 2.5%intheR@1accuracy. These videos contain transcribed narrations thatareeither uploaded manually byusersor aretheoutputofanautomatic speech recognition (ASR) system.




Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper addresses the issue of object detection, in particular the challenge of obtaining bounding boxes on a scale similar to which category labels exist for object categorization. The authors side-step this challenge by proposing to adapt object classifiers for the detection task. Their algorithm is fairly simple and straightforward, which is not a bad thing in itself. Their experimental protocol uses 100 categories for training (with both category labels and bounding boxes), and tests on 100 left-out categories.


A Supplementary

Neural Information Processing Systems

In this supplementary material, we provide the following additions to the main submission: A.1. We use ReLU as the activation function. We provide an illustration of our weakly supervised phrase grounding model in Figure 4b (this supplemental). To incorporate our proposed CoMMA into the model of Gupta et al . Finally, the sentence loss is weighted by a hyperparameter.


Dur360BEV: A Real-world 360-degree Single Camera Dataset and Benchmark for Bird-Eye View Mapping in Autonomous Driving

E, Wenke, Yuan, Chao, Li, Li, Sun, Yixin, Gaus, Yona Falinie A., Atapour-Abarghouei, Amir, Breckon, Toby P.

arXiv.org Artificial Intelligence

We present Dur360BEV, a novel spherical camera autonomous driving dataset equipped with a high-resolution 128-channel 3D LiDAR and a RTK-refined GNSS/INS system, along with a benchmark architecture designed to generate Bird-Eye-View (BEV) maps using only a single spherical camera. This dataset and benchmark address the challenges of BEV generation in autonomous driving, particularly by reducing hardware complexity through the use of a single 360-degree camera instead of multiple perspective cameras. Within our benchmark architecture, we propose a novel spherical-image-to-BEV module that leverages spherical imagery and a refined sampling strategy to project features from 2D to 3D. Our approach also includes an innovative application of focal loss, specifically adapted to address the extreme class imbalance often encountered in BEV segmentation tasks, that demonstrates improved segmentation performance on the Dur360BEV dataset. The results show that our benchmark not only simplifies the sensor setup but also achieves competitive performance.